feat: add gpu process job#102
Open
lzi-a11y wants to merge 27 commits into
Open
Conversation
In RoCE environments with multi-port NICs (e.g. 8x NICs x 4 ports), the sequential mlxlink collection timed out before completing, resulting in 'No transceiver info available'. - Parallelize all mlxlink/ethtool calls in Collect() using goroutines instead of sequential loops, reducing wall time from ~48s to ~22s - Record PCIe BDF for IB-device primary net interfaces during enumeration - Skip Ethernet interfaces that share a BDF with an already-collected IB interface (filters eth_vf_rep_* duplicates that report the same module) - Fix RestartSystemdService: use separate contexts for daemon-reload and restart, add --no-block to avoid blocking on Type=notify ready signal
eth_vf_r* SR-IOV VF ethernet interfaces were being enumerated and passed to mlxlink, which took 8-11s to fail per interface (8 VFs x ~10s blocked the collection window). The same physfn check already used for IB VFs is now applied to ethernet interfaces during enumeration. Also cap concurrent mlxlink workers at 8 to reduce PCIe device lock contention when running alongside the daemon.
Replace hardcoded consts.Red in component PrintInfo with a new LevelColor(level) helper that maps: - LevelWarning -> Yellow - LevelCritical / LevelFatal -> Red - other -> Green Applied across cpu, dmesg, ethernet, gpfs, gpuevents, infiniband, nvidia, pcie/topotest, podlog, syslog, transceiver so warnings no longer share the same red color as critical/fatal events.
Introduce a separate in-cluster collector application that receives latest snapshot.json from each node via HTTP POST and persists one file per node. Node-side reporter lives as a module in sichek daemon; analysis service (outside cluster) fetches via SSH/rsync or optional HTTP GET. Storage is latest-only (no archival), no HA, no DB.
Two plans for the sichek-collector design: - 2026-04-23-sichek-collector-app.md: standalone new repo. Tasks cover scaffold, config, store interface + FS impl, middleware, POST/GET handlers, main entry, Docker, K8s manifests, E2E test. - 2026-04-23-sichek-reporter-module.md: integrates a reporter goroutine into the existing sichek daemon. Tasks cover config loader, pushOnce with gzip + retry, ticker loop, node-name resolution, DaemonService wiring, YAML config update. Both plans are self-contained with exact file paths and code, suitable for subagent-driven execution.
GetPCIETreeMin walked the upstream PCIe path and tracked the minimum width/speed value but threw away which bridge that minimum came from. The checker then had to fall back to printing the IB device's own BDF (e.g. mlx5_5(0000:aa:00.0)), which is misleading because the actual bottleneck is at one of the upstream PCIe bridges. Return the matching BDF alongside the value, store it on IBHardWareInfo as PCIETreeWidthMinBDF / PCIETreeSpeedMinBDF, and surface it in the checker detail/suggestion as "bottleneck@<upstream-bdf>". Verified on bjg45 (positive: bottleneck@0000:a7:01.0 now shown) and on lmg86/thg1/clnet36 (healthy baseline still PASS, no regression).
A degraded upstream PCIe path silently caps RDMA throughput well below the HCA's rated speed/width and is not safely ignorable, so the two checks should surface as Critical (cordon-now) rather than Warning (schedule-for-fix).
Why: zy3 (B300 NVL8 / CX8 RoCE) exposes 12 ports per IB device with the data path on ports 3/6/9/12 (eth_rX_p0..p3); the legacy ports/1 hard-coding misreports all 8 cards as DOWN. What: - spec: add device_ports/default_ports + (*InfinibandSpec).PortsFor; both default empty so existing single-port clusters stay on port 1. - collector: ib_hardware_info / ib_counters take a port arg; Collect() emits one record per (ibdev, port) keyed as "<ibdev>/p<port>". Per-port netdev resolved via cached `rdma link` output. - checkers: per-port reports use "ibdev/pN" labels; per-device checkers (fw, ofed, pcie_*, roce) dedupe on hwInfo.IBDev so multi-plane HCAs aren't reported four times. - metrics: gauges gain a "port" label; series cleanup keyed by (dev, port). - spec yaml: add `zy` cluster (8x roce_rX with [3,6,9,12], 4x mezz on port 1) + NVD0000000072 (CX8) and NVD0000000079 (CX7 mezz) HCA specs.
PCIETreeSpeedChecker compared the path-min speed read from sysfs against
hcaSpec.Hardware.PCIESpeed (device link speed) and PCIETreeWidthChecker
likewise used PCIEWidth. This silently worked while every supported board
had link-speed == tree-speed, but breaks on CX8: the card itself links at
PCIe Gen6 (64 GT/s) while the upstream switch caps the tree at 32 GT/s,
so the checker reports a false positive even with a correct
pcie_tree_speed/pcie_tree_width entry in spec.
Switch the tree checkers to read the dedicated PCIETreeSpeedMin /
PCIETreeWidthMin fields, falling back to the device-level value when the
spec omits them (preserves behaviour for older boards).
Also accept loose number formatting between spec and sysfs ("32" vs
"32.0") via a numeric comparison helper, so spec authors don't have to
mirror sysfs's trailing-zero formatting verbatim.
Update the NVD0000000072 (CX8) entry to reflect reality:
pcie_speed=64 GT/s, pcie_tree_speed=32.
Two regressions surfaced while running the multi-plane build through field testing on a node with no IB hardware (cl-nctl01 / clnet35): 1. NewInfinibandComponent's user-config fallback path created a default cfg but skipped the cache allocation, so cacheSize stayed 0 and the first LastInfo() call indexed cacheInfo[-1] → panic. Allocate the buffers in that branch so the daemon can keep reporting initError instead of crashing. 2. PrintInfo asserted info.(*InfinibandInfo) unconditionally and printed the opaque "invalid data type" line on the init-error path (info is nil because no health check has cached anything yet). Print the captured CheckerResults instead so operators see why initialization failed (missing spec, no IB hardware, etc.) at a glance.
The collector unconditionally skips IB devices whose name contains "mezz" (`infiniband_info.go::Collect`), so listing them under `zy.ib_devs` had no effect on the per-port hwInfo records produced for the zy cluster. The mezz board id (NVD0000000079) is still required in the top-level `hca:` map so spec validation passes — that part is unchanged.
# Conflicts: # components/infiniband/checker/pcie_tree_speed.go
GPUHang has multiple known false-positive sources (pviol thermal-vs-power bug, rxpci/txpci delta semantics, strict 8/8 AND counter reset on any indicator dip). Mute via ignored_checkers until the rule is reworked. See docs/gpu-hang-detection-summary.md for the alignment notes.
device.GetComputeRunningProcesses was already called in DeviceInfo.Get but only its length was kept (as NProcess). Capture the full list now: PID, process name (/proc/<pid>/comm, silent empty on failure), and GPU memory in MiB. NProcess is derived from len(Processes) so its meaning is unchanged. Field appears in snapshot.json under gpu_devices[].compute_processes and in the reporter payload automatically since NvidiaInfo is JSON-marshaled raw. Field-tested on clnet36 (8x H20) with vLLM workers visible as VLLM::Worker_TP at ~92 GiB each.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: add gpu process job